Round 1: Technical
✅ Tell me about your work experience and any recent projects you were part of.
✅ Questions related to client and cluster deployment.
✅ Explain the spark-submit command and the configurations you applied in your project.
✅ How can you optimize Spark code for better performance?
✅ How do you handle data skewness in Spark?
✅ How would you handle a situation where the number of columns in your data source keeps on increasing or decreasing?
✅ Write Spark code to process such data, assuming the source files are in JSON format and stored in Azure Data Lake Storage (ADLS).
✅ 2-3 scenario-based questions related to Azure Data Factory (ADF).
✅ How would you implement Slowly Changing Dimension (SCD) Type 2 in your organization?
✅ One scenario was given where you had to identify:
✅ The type of schema used.
✅ The tables and relationships between them.
✅ The constraints applied to those tables.
✅ What is one key difference between RDDs, DataFrames, and Datasets in Spark?
✅ What is the difference between data lake storage, data warehouses, and Delta Lake?
✅ What is the difference between the Temp view tables and temp tables, and where are they used?
✅ Explain SQL triggers and how SQL execution works.
✅ Write Spark code to create a DataFrame from a CSV file where the delimiter is | instead of a comma.
✅ Write Spark code to find the highest salary for each department using the following dataset:
data = [
[1001, 'Marlania', 92643, 1],
[1002, 'Briana', 87202, 1],
[1003, 'Maysha', 70545, 1],
[1004, 'Jamacia', 65285, 1],
[1005, 'Kimberli', 51407, 2],
[1006, 'Lakken', 88933, 2],
[1007, 'Micaila', 82145, 2],
[1008, 'Gion', 66187, 2],
[1009, 'Latoynia', 55729, 3],
[1010, 'Shaquria', 52111, 3],
[1011, 'Tarvares', 82979, 3],
[1012, 'Gabriella', 74132, 4],
[1013, 'Medusa', 72551, 4],
[1014, 'Kubra', 55170, 4]
]
columns = ['emp_id', 'emp_name', 'salary', 'emp_dep_id']
✅ Write Spark code to calculate the average price for products based on the following tables:
Prices Table:
| product_id | start_date | end_date | price |
+------------+------------+------------+--------+
| 1 | 2019-02-17 | 2019-02-28 | 5 |
| 1 | 2019-03-01 | 2019-03-22 | 20 |
| 2 | 2019-02-01 | 2019-02-20 | 15 |
| 2 | 2019-02-21 | 2019-03-31 | 30 |
+------------+------------+------------+--------+
Units Sold Table:
+------------+---------------+-------+
| product_id | purchase_date | units |
+------------+---------------+-------+
| 1 | 2019-02-25 | 100 |
| 1 | 2019-03-01 | 15 |
| 2 | 2019-02-10 | 200 |
| 2 | 2019-03-22 | 30 |
+------------+---------------+-------+
Expected Output:
+------------+---------------+
| product_id | average_price |
+------------+---------------+
| 1 | 6.96 |
| 2 | 16.96 |
+------------+---------------+
✅ Write Python code to merge two strings alternately. For example:
Input:
word1 = "abc"
word2 = "pqr"
Output:
"apbqcr"
Input:
word1 = "ab"
word2 = "pqrs"
Output:
"Apbqrs"
Round 2: Techno Managerial
✅ What are your skillsets, roles, and responsibilities in your current project?
✅ Consider a pipeline where you initially performed a full data load, but now you want to load data incrementally.
✅ How would you implement this change using Databricks?
✅ Scenario-based questions related to Spark.
✅ Scenario-based questions related to handling schema evolution.
✅ Write a SQL query to retrieve employees whose salary is greater than the average salary of their department.
Emp Table: empid, emp_name, salary, deptid
Dept table: deptid, dept_name
Expected Output: empid, emp_name, salary, dept_name
✅ Write a SQL query to find IDs that appear 3 consecutive times in a table.
Log Table:
id num
1 1
2 1
3 1
4 2
5 1
6 2
7 2
Expected Output: 1
✅ Explain how LEAD and LAG functions work in SQL with an example.
Round 3: HR
✅ Discussion around my experience and projects, some resume-based questions
✅ What are you expecting in your next job role?
✅ How soon can you join the company and what is my preferred location.